System and method of monitoring and measuring performance relative to expected performance characteristics for applications and software architecture hosted by an iaas provider

ABSTRACT

The present disclosure is directed to a system for monitoring and analyzing operation of a widely distributed service operated by an Infrastructure-as-a-Service (IaaS) tenant but deployed on a set of virtual resources controlled by an independent IaaS provider. The set of virtual resources provided to the IaaS tenant by the IaaS provider is hosted on a set of physical resources selected by the IaaS provider, and both the set of virtual resources and the set of physical resources can change rapidly in both size and composition (i.e., the resources are “ephemeral”). Although the monitoring system may not have visibility into the composition, configuration, location, or any other information regarding the set of physical resources, the monitoring system can evaluate the performance of the virtual resources and infer that a virtual resource within the set of virtual resources may be hosted on at least one physical resource that is underperforming.

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of priority under 35 U.S.C. §119(e)to U.S. Provisional Application No. 61/879,002, filed Sep. 17, 2013,which is hereby incorporated by reference in its entirety.

This application is related to the following references:

U.S. patent application Ser. No. (TBA), filed concurrently herewith andentitled “SYSTEM AND METHOD OF ALERTING ON EPHEMERAL RESOURCES FROM ANIAAS PROVIDER,” with Attorney Docket No. 2209392.125;

U.S. patent application Ser. No. (TBA), filed concurrently herewith andentitled “SYSTEM AND METHOD OF SEMANTICALLY MODELLING AND MONITORINGAPPLICATIONS AND SOFTWARE ARCHITECTURE HOSTED BY AN IAAS PROVIDER,” withAttorney Docket No. 2209392.126;

U.S. patent application Ser. No. (TBA), filed concurrently herewith andentitled “SYSTEM AND METHOD OF ADAPTIVELY AND DYNAMICALLY MODELLING ANDMONITORING APPLICATIONS AND SOFTWARE ARCHITECTURE HOSTED BY AN IAASPROVIDER,” with Attorney Docket No. 2209392.127; and

U.S. patent application Ser. No. (TBA), filed concurrently herewith andentitled “SYSTEM AND METHOD OF MONITORING AND MEASURING CLUSTERPERFORMANCE HOSTED BY AN IAAS PROVIDER BY MEANS OF OUTLIER DETECTION,”with Attorney Docket No. 2209392.129.

TECHNICAL FIELD

In general, the present disclosure relates to methods, apparatuses andsystems for measuring and monitoring resources and metrics related tocloud-based applications. Specifically, the present disclosure relatesto systems, methods, and non-transitory computer program products formonitoring and measuring performance relative to expected performancecharacteristics for applications and software architecture hosted by anInfrastructure-as-a-Service (IaaS) provider.

BACKGROUND

Instead of being provided from a centralized set of infrastructure ownedand operated by a single entity, data services and applications beingoffered on the Internet today are increasingly being hosted in avirtualized, multi-tenant infrastructure environment. For example,whereas a photo sharing service might have formerly been hosted on a setof servers and databases operated by the owner or operator of the photosharing service, today that same photo sharing service might be hostedon a set of “virtual” infrastructure operated by third party providers,such as Amazon Web Services, Google Compute Engine, OpenStack, orRackspace Cloud. In other words, data services and applications mightnow be hosted “in the cloud.” Such “virtual” infrastructure or resourcescan include virtual web servers, load balancers, and databases hosted assoftware instances running on separate hardware.

There are several reasons commonly cited for building applications forthe cloud. Using cloud resources reduces the time required to provisiona new virtual infrastructure to effectively zero. Traditionally, it tookweeks or even months to acquire, install, configure, network, and imagenew hardware. Cloud users can launch new instances from aninfrastructure provider in a matter of minutes. Another key reason forbuilding for the cloud is the elastic nature of the cloud. When acustomer needs more virtual resources, they request more to beprovisioned for them. When they are done with the resources, they returnthem to the provider. The provider charges customers for resources onlywhen they are in use (typically on an hourly basis). Elasticity allowscustomers to adjust the number of resources they use (and pay for) tomatch the load on the application. The load on the application may varyaccording to trends that are short (hourly or daily cycles) or long(growth of the business over months).

Accordingly, there is a need to provide a system which can collect datafrom virtual infrastructure, monitor and process the data to identifyanomalies and potential areas of concern, and report the results tooperators of a data service and/or application hosted on the cloud.However, the benefits of the cloud are some of the same things that makethe cloud hard to monitor. With virtualized resources, a description ofa resource and its behavior is not available via a single source. Forexample, the infrastructure provider can use Application ProgramInterfaces (APIs) to provide metadata about a virtual resource withinthe virtual environment (e.g., where it is located, why type of resourceit is, the capacity allocated to the resource, etc.). However,information about what is running inside the virtual container is onlyavailable from within that container—such information cannot be providedby querying the infrastructure provider's APIs. Secondly, because of theelastic nature of the cloud, the configuration of an customerapplication can change very quickly. It is not uncommon for the number(and therefore, aggregate capacity) of resources used by a customer tofluctuate by hundreds per day to accommodate diurnal patterns, or bythousands of resources in a matter of weeks to track business growth.These changes in resources can be driven by demand for the customer'sapplication (e.g., more resources are provisioned if more people areusing the application), supply of resources (e.g., a customer mayrequest that additional resources be provisioned only if the price ofusing these resources fall below a certain threshold), and/or scheduledpatterns (e.g., additional resources are provided during expected peakdemand times during the day). The monitoring tool must be able tooperate within a dynamic environment that is changing faster than can betracked by human operators.

SUMMARY OF THE INVENTION

In accordance with the disclosed subject matter, systems, methods, andnon-transitory computer program products are provided for monitoring andmeasuring performance relative to expected performance characteristicsfor applications and software architecture hosted by an IaaS provider.

Certain embodiments include systems for determining that a virtualresource within a set of virtual resources may be hosted on at least onephysical resource that is underperforming. The set of virtual resourcesmay be provided by an independent Infrastructure-as-a-Service (IaaS)provider to an IaaS tenant for operating a widely distributed service.The virtual resources may be hosted on a set of physical resources thatmay be geographically dispersed, part of different communicationnetworks, or disjoint. The IaaS provider may be responsible forselection of the set of physical resources and an operational capacityof the set of virtual resources may change substantially and rapidly.The IaaS tenant may have no direct control over and limited visibilityinto the selection of the set of physical resources. The system mayinclude a data gateway and an analysis module. The data gateway may beconfigured to receive CPU utilization information related to theoperation of the set of virtual resources. The analysis module may beconfigured to determine that a candidate virtual resource may be hostedon an at least one underperforming physical resource based on at leastone of: (i) a comparison of CPU utilization of the candidate virtualresource with CPU utilization of other virtual resources within the setof virtual resources that are expected to perform similarly, (ii) acomparison of present CPU utilization of the candidate virtual resourcewith historical CPU utilization of the candidate virtual resource, and(iii) a comparison of CPU utilization of the candidate virtual resourcewith preconfigured thresholds.

The embodiments described herein can include additional aspects. Forexample, the analysis module may be further configured to suggest thatthe IaaS tenant terminate and relaunch the candidate virtual resourcethat is determined to be hosted on at least one underperforming physicalresource so that the candidate virtual resource may be reassigned toanother physical resource by the IaaS provider. The CPU utilizationinformation may include a CPU steal metric, a CPU utilization metric, ora CPU idle metric. The comparison of CPU utilization of the candidatevirtual resource with CPU utilization of other virtual resources thatare expected to perform similarly may include a comparison of averageCPU steal metrics during a predefined time interval. The preconfiguredthresholds may be based on an expected level of performance for theresources provided by the IaaS provider to the IaaS tenant. The systemmay further include an infrastructure platform collector configured toquery Application Program Interfaces (APIs) defined by the IaaS providerto collect infrastructure metadata characterizing the set of resources,and to detect when queries to APIs result in an error condition. Theanalysis module may be configured to determine that a candidate virtualresource may be hosted on an at least one underperforming physicalresource based on the error condition. The error condition may furtherinclude an error code, incorrect metadata, or a delayed response.

BRIEF DESCRIPTION OF FIGURES

FIG. 1 is a block diagram showing the major components of the monitoringsystem, according to some embodiments.

FIG. 2 shows how the technology stack for a cloud-hosted application canbe integrated with the presently disclosed monitoring system, accordingto some embodiments.

FIG. 3 shows data flows between collectors, data stores and analysismodules in the presently disclosed monitoring system, according to someembodiments.

FIG. 4A shows the batch analysis and reporting module of the presentlydisclosed monitoring system, according to some embodiments.

FIGS. 4B-4I show flowcharts depicting data analyses that can beperformed by the batch analysis and reporting module, according to someembodiments.

FIG. 5 is a block diagram showing the operation of the Event Detectionmodule, the Exception Monitoring module, and the Policy Analyzer module,according to some embodiments.

FIG. 5B is a flowchart depicting one example algorithm that can be usedby Intelligent Change Detection analysis in the Exception Monitoringmodule, according to some embodiments.

FIG. 6 is a block diagram showing the operation of the User Interface,according to some embodiments.

FIG. 7 is a block diagram showing the operation of the NotificationGateway, according to some embodiments.

FIG. 8 is a block diagram showing how Infrastructure Platform Collectorcan be scaled to collect data from large numbers of Provider APIs,according to some embodiments.

FIG. 9 is a block diagram showing how the Data Gateway can be scaled tocollect data from large numbers of data collectors, according to someembodiments.

DESCRIPTION

The present monitoring system can evaluate performance of virtualresources deployed in a widely distributed service operated by anInfrastructure-as-a-Service (IaaS) tenant but deployed on a set ofvirtual resources controlled by an independent IaaS provider, and inferthat a virtual resource within the set of virtual resources may behosted on at least one physical resource that is underperforming.Although the monitoring system may not have visibility into thecomposition, configuration, location, or any other information regardingthe set of physical resources, the present system is able to evaluatethe performance of the virtual resources and infer that a virtualresource within the set of virtual resources may be hosted on at leastone physical resource that is underperforming.

The monitoring system can implement a service that can be used by othercompanies to monitor the performance of their systems, virtualinfrastructure, hosted services, and applications. The monitoring systemcan be hosted in a virtualized multi-tenant infrastructure environment.The system collects inventory and monitoring data from the customer'senvironment via several methods, analyzes that data, identifiespotential issues, stores the data for future retrieval, and notifies thecustomer when issues occur (or are likely to occur in the future).Customers access the monitoring system user interface via a web browserusing standard Internet protocols. The monitoring system notifiescustomers of issues via electronic mail, SMS, and calls to APIs providedby third party services.

In one embodiment, one example of a customer would be a photo sharingservice that operates on top of virtual “cloud” infrastructure. Endusers leverage the photo sharing service to upload, store, edit, andshare digital photographs. Main components of such an application wouldlikely include a set of cloud servers supporting photo upload, a hostedstorage subsystem for storing the photos, and a set of cloud serverssupporting photo retrieval. Other supporting components could include,for example, various databases (e.g. for authentication, preferences,indexes, etc.) and customer-developed processing code (e.g. to resizephotos).

The photo sharing company can configure and use the presently disclosedmonitoring system to monitor each of the key components of its photosharing application (e.g., cloud servers, hosted storage subsystems,databases, application code, and other components). Once monitoring isconfigured, the monitoring system can begin collecting and analyzingthat information and notifying the customer of issues. Staff at thephoto sharing company would log on the monitoring system service fromtheir web browsers in order to review the configuration and key metricsrelated to their environments.

It is to be understood that although the above discussion uses a photosharing company as an example, this example is meant to be illustrativeonly—the present disclosure is not limited to any particularapplication. Also, while the terms “customer” and “user” are both usedin the present disclosure, it is to be understood that the two terms areto be considered interchangeable, and that the present disclosure is notlimited to a commercial service-provider and customer application. Forexample, the present disclosure could be used by the sameorganization/enterprise, or by a governmental entity.

FIG. 1 is a block diagram showing the major components of the monitoringsystem, according to some embodiments. Specifically, FIG. 1 shows themajor components of the system classified into four categories, asoutlined in horizontal divisions. Collector modules 10 can capture datafrom customer environments and transmit that data to the monitoringsystem; these modules can include Infrastructure Platform Collector 301,System Data Collector 302, Application Data Collector 303, Log DataCollector 306, and Endpoint Monitoring Probes 307. Data modules 12 canprovide storage for customer and application information; these modulescan include Live Metric Database 401, Resource/Metadata Store 402,Metric Archive 403, Policy Database 404, Application Database 405, andEvent Store 406. Analysis modules 14 can evaluate data collected fromcustomer environments; these modules include Application & TopologyDiscovery 501, Event Detection 502, Batch Analysis & Reporting 503, andException Monitoring 504. Interface modules 16 can serve as the user'smechanism for accessing the service; these modules include UserInterface 601, Notifier 6021, and Automation 603. As used herein,“modules” can mean any of a hardware component, a software routine,and/or a data service. Each of the above-described modules will bedescribed further in relation to FIGS. 2-9 below.

FIG. 2 shows how the technology stack for a cloud-hosted application canbe integrated with the presently disclosed monitoring system in oneembodiment. FIG. 2 shows a Monitoring System Environment 102 whichcommunicates with Customer Environment 104. Customer Environment 104 canhost a customer application, which can be composed of collections of AppServices 112 (e.g., MongoDB, Apache, etc.), Hosted Services 108 (e.g.,ELB, RDS, etc.), and custom Application Code 114. Hosted services 108can be well-known application building blocks such as databases and webservers. Custom Application Code 114 and App Services 112 can run atopan System/Operating system 110. System/OS 110 can be run in virtualinfrastructure, such as that provided by an IaaS vendor, like Amazon WebServices, Google Compute Engine, OpenStack, or Rackspace Cloud.Monitoring System Environment 102 uses various integration points tointeract with Customer Environment 104.

In Customer Environment 104, infrastructure Provider APIs 305 can beinterfaces provided by a cloud provider to retrieve information aboutInfrastructure 100 that a customer is using from the cloud provider.Provider APIs 305 can provide an inventory of the virtual resourcesallocated to the customer. Because of the elastic nature of the cloud,the inventory may change on a minute to minute basis. It is importantthat the Monitoring System Environment 102 be queried frequently enoughto observe fast changes in the infrastructure (e.g., every fiveminutes). Provider APIs 305 can return data about the behavior of thevirtual infrastructure (e.g., CPU utilization, disk I/O rates). ProviderAPIs also provide data about hosted services (request rate on virtualload balancer, query latency on hosted database). In Monitoring SystemEnvironment 102, the monitoring system can use Infrastructure PlatformCollector 301 to query Provider APIs 305 on behalf of the customer usinga read-only role provided by the customer. The data returned by theProvider APIs 305 can then be normalized and forwarded to the DataGateway 701, also in the Monitoring System Environment 102. Becausethese Provider APIs 305 are defined publicly, the monitoring system canfully understand the semantics of the data and corresponding metadatafor all results.

To collect data at the operating system level (or metric data), themonitoring system can use a System Data Collector 302 or “agent”installed in each application instance to be monitored on the CustomerEnvironment 104. For example, in one embodiment, the monitoring systemcan use the open source agent CollectD as the starting point for SystemData Collector 302. System Data Collector 302 can also collect data fromwell-known software services (e.g., Apache or MongoDB). System DataCollector 302 can also use a plugin model to support services. Aftercollecting data, System Data Collector 302 forwards the data to DataGateway 701 hosted by the monitoring system environment 102. DataGateway 701 and System Data Collector 302 can use SSL for security.System Data Collector 302 can also include with each message a sharedsecret, called an API key, to authenticate itself to Data Gateway 701,and/or use hashing for message integrity. Because data is collected by aknown agent, the monitoring system understands the semantics of the dataand metadata coming from the System Data Collector 302. Again, data atthe operating system level can change rapidly on a minute-to-minutebasis, and so System Data Collector 302 can be configured to collectthis data at a rate fast enough to capture an expected rate of change inthis data (e.g., every 5 minutes).

Customer Environment 104 can also include custom Application Code 114,which can be configured to measure and report data that is important tothe customer's applications via Application Data Collector 303.Application Code 110 sends any measurements that it wants to monitorwith the monitoring service to Data Gateway 701; these measurements canalso be sent using SSL, the API key, and hashing as described above.Application Data Collector 303 can be a simple language-specific libraryprovided by the monitoring system or a third-party to simplify the taskof formatting and sending message to the monitoring system.Alternatively, the customer may write software which serves as theApplication Data Collector 303. Because custom measurements are fullydefined by the custom applications, the monitoring system may not knowhow to interpret custom metrics.

Custom application code and services used by customers can generate lotsof data in logs. The monitoring system can accept log data via Log DataCollector 306. Log Data Collector 306 in turn forwards the data to DataGateway 701. Another source of log data is the infrastructure provider.To collect infrastructure provider logs, Log Data Collector 306 can useProvider APIs 305. Log Data Collector 3076 can be configured to collectlog data at a rate frequent enough to capture an expected rate of changein the log data, for example, every five minutes.

Applications to be monitored by the monitoring system are often exposedto their users via one or more HTTP endpoints. Endpoint MonitoringProbes 307 in the monitoring system environment 102 can monitor thehealth and performance of application endpoints from the perspective ofthe user. These probes can be located in several geographicallydistributed locations, for example, Europe, Asia, Africa, North Americaor South America. Metrics related to the availability (e.g., can theendpoint be contacted?), health (e.g., does the endpoint respond asexpected?), and performance (e.g., what is the request latency for theendpoint?) of the customer's application as they appear from theperspective of an end user, among others, can be collected by EndpointMonitoring Probes 307. Endpoint Monitoring Probes 307 can then forwardthese metrics to Data Gateway 701. Endpoint Monitoring Probes 307 can beconfigured to collect these metrics at a rate fast enough to capture anexpected rate of change in the metrics, for example, every five minutes.

Monitoring System Environment 102 can also include an Automation Engine603 that supports the automation of certain tasks on behalf of thecustomer. For example, based on a condition or schedule, the monitoringsystem can initiate an action, such as rebooting an instance,provisioning a new instance, or adding disk capacity to a database. Inthis way, the monitoring system can cause tasks to be performed onresources in Customer Environment 104. The monitoring system can useProvider APIs 305 to modify the customer infrastructure in CustomerEnvironment 104.

To initiate and control actions in Customer Environment 104, themonitoring system can be granted more privileges than the read-onlypermissions required to query APIs and collect data. If these privilegesare granted, the monitoring system can initiate actions on the customerenvironment based on input from the customer received using themonitoring system's user interface. For example, when the customerclicks an icon in the monitoring system's user interface to snapshot ablock device, the monitoring system can first verify that it hassufficient permissions from the customer and can then call Provider APIs305 to start the action.

FIG. 3 shows data flows between collectors, data stores and analysismodules in the presently disclosed monitoring system, according to someembodiments. Specifically, FIG. 3 shows data flows into and within themonitoring service in more detail. The monitoring system can storedifferent types of data. Each data type can be stored in its own datastore. The system can also share the data store among several differentdata stores and even data store technologies.

Data Gateway 701 provides a single ingest service into which data fromeach of Infrastructure Platform Collector 301, System Data Collector302, Application Data Collector 303, Log Data Collector 306 and EndpointMonitoring Probes 307 can be fed. From there, data can be forwarded fromData Gateway 701 to the appropriate data store, as described below.

Live Metric Database 401 can enable rapid storage and retrieval ofmetrics and other data from the customer environment. The primarycontents of the Live Metric Database 401 can be time series ofindividual metrics. The Database 401 can also store aggregations of theoriginal time series, i.e., aggregated time series which can containdata rolled-up into coarser time granularity for efficient retrieval andpresentation of long time scales. If the monitoring system does not backup the Database 401, the monitoring system can instead rely onfault-tolerance of the Database 401 and an Archive 403 for disasterrecovery. In one example embodiment, Live Metric Database 401 can beimplemented using Cassandra.

Resource/Metadata Store 402 can enable rapid retrieval of resources,properties of resources, and topology information for the customer'sapplication and infrastructure. Resource/Metadata Store 402 can bepopulated by infrastructure metadata collected by Provider APIs 305 viaInfrastructure Platform Collector 301. This infrastructure metadatacollected from Provider APIs 305 and stored in Resource/Metadata Store402 can include (i) infrastructure-provider metadata expressed in afixed format prescribed by the infrastructure provider characterizingthe resources then being used by the customer's application, (e.g.,resource type, capacity, etc.), and (ii) operator metadata expressed inarbitrary text specific to, or perhaps supplied by, a particularcustomer that the customer uses to characterize the resources. Thisoperator metadata can include customer-specific naming conventions forresources, or customer-specific codes or terms establishing securityrules and policies (e.g., firewall rules), resource roles (e.g.,web-server, load balancer), geography (e.g., Asia, Europe), businessfunction (e.g., Advertising, Customer Support), organizational businessunit (e.g., widget 1, widget 2), etc. In short, operator metadata caninclude customer-specific text that reflects how the customerintuitively thinks about and organizes the resources in itsinfrastructure. In one embodiment, Store 402 can rely on theElasticSearch distributed search and analytics engine and can be scaledhorizontally by adding additional nodes.

The infrastructure metadata (including one or both of theinfrastructure-provider metadata and the operator metadata) can bethought of as including information about an actual state of a virtualresource, or an anticipated state of a virtual resource. An actual“state” of a resource can include information regarding the resource'stype (e.g., AWS t1-micro vs. AWS m1-small resource as used in Amazon WebServices (AWS)), the resource's role (e.g., web server vs. loadbalancer), or an operational status of a resource. The operationalstatus of a resource can include whether the resource is “terminated,”meaning that the resource has been de-allocated from the customer'sapplication and is no longer available for use by the customer, orwhether the resource is “stopped,” meaning that the resource has beentemporarily suspended and the customer is being charged a reduced rate,but that the resource can be restarted with minimal delay time. If aresource has been terminated or stopped, the infrastructure metadata canalso include information regarding whether this termination or stoppageis prompted by a request from the customer (e.g., because the customerno longer needs the resource), or by the infrastructure provider (e.g.,because the resources are needed for other uses, or because theresources have crashed). An anticipated “state” of a resource caninclude information regarding an expected availability of the virtualresource in the future (e.g., a notice by the infrastructure providerthat the resource will be decommissioned as of a certain date), orinformation regarding whether the resource is scheduled for terminationand/or stopping at some point in the future.

The difference between Live Metric Database 401 and Resource/MetadataStore 402 is the type of data stored about the customer'sapplication/infrastructure. The Resource/Metadata Store 402 can recordmetadata about the resources in the customer's system, such as instancetype or location. On the other hand, Live Metric Database 402 can recordthe instantaneous behavior of the customer's system, like memory usage,network bandwidth usage, request latency, etc. Resource/Metadata Store402 can be populated by information from Provider APIs 305, while theLive Metric Database 402 can be populated by information from SystemData Collector 302, Application Data Collector 303, Log Data Collector306, and/or Endpoint Monitoring Probes 307.

Event Store 406 can be used to store events. Events can be a type ofdata that describes discrete occurrences or changes in the system.Events can cover much of the activity of the customer application thatcannot be captured by additional time series metrics or measurements.These events can be stored in the Event Store 406. In one exampleembodiment, the monitoring system can store events in ElasticSearch forfast querying and filtering.

Metric Archive 403 can enable long-term preservation and batch analysisof customer metric data. It can also serve as a foundation for disasterrecovery. Metric data can be stored in raw, uncompressed, unencryptedformat in an object storage system. In one example embodiment, metricdata can be stored using the JSON encoding format.

Policy Database 404 can include both customer-defined and default logicagainst which Exception Monitoring System 504 (described in furtherdetail below in relation to FIG. 5) can evaluate metrics and events.Customer-defined policies are created and modified via the monitoringsystem's user interface. In one example embodiment, the monitoringsystem can use a hosted version of MySQL for Policy Database 404.

Application Database 405 can store customer-specific preferences andconfigurations defined by the customer. These preferences andconfigurations can include things such as the definition of groups (andsubgroups and clusters). Database 405 can also record “dashboards” thatthe customer defines, which are preconfigured displays of metrics anddisplay settings that are of particular interest to the customer.Database 405 can also record the customer's notification configurationsand preferences. In one example embodiment, the monitoring system canuse a hosted version of MySQL for this store.

The monitoring system can also perform a number of analyses on the datacollected from a customer's infrastructure. These analyses are shown atthe bottom of FIG. 3.

Application and Topology Discovery 501 can analyze data and metadatafrom the customer's application to establish a service architecture ofthe roles and relationships between components of the customer'senvironment. This service architecture of established relationships canbe used to set appropriate defaults for the customer's monitoring systemsettings and improve the relevance of the monitoring system's analysis.This service architecture can approximate how the customer intuitivelythinks about and organizes the resources in its infrastructure, asillustrated in the above-discussed examples. The customer can alsodeclare their service architecture to improve on what the monitoringsystem discovers. The most common relationships between components ofthe customer's environment are “groups” and “clusters”.

A “group” is a set of resources (possibly of different types, e.g.,servers, databases, load balancers, data services, etc.) that areconsidered as a single unit. Customers can define groups to helporganize resources in the monitoring system to match how they areorganized in their organization. Customers can define groups, forexample, based on deployment type (Production versus Staging),architectural subsystem (such as media transcoder or ad server),geographic location (US-east or Europe), or organizational boundaries(Finance or HR).

A “cluster” is a special type of group in which all resources areexpected to behave in a similar way. When clusters are defined, themonitoring system can perform additional types of analysis. A“Production” group would likely not be a cluster because it wouldinclude a variety of resources performing a variety of functions. On theother hand, a “Ad Server” group would be more likely to be a clusterbecause it is probably a set of instances (e.g., a set of web servers)all performing the same function with roughly similar workloads in thecustomer's application. Defining a “group” as a “cluster” can enablespecial types of analysis: for example, since we expect all members of acluster to behave similarly, the monitoring system can be configured todetect when one member of a cluster is not behaving in the same way asits peers, and notify the customer accordingly.

Groups and clusters can be nested arbitrarily. For example, a“Production” group might have child groups for “Ad Servers” and “BillingSystem”. The “Ad Servers” group may be defined to be a cluster. Withinthe “Ad Servers” cluster, there may be an additional categorization of“US” and “Europe” clusters that describe where they are hosted. Underthe “Billing System” group, there may be subgroups for “USD” and “Euro”to reflect the currency type that different components are designed tosupport. A plurality of clusters can also be nested within a group.

Application and Topology Discovery 501 can use a number of techniques toautomatically infer a service architecture of the customer's applicationbased solely on the infrastructure metadata, without human operatormodeling input or information regarding the actual physical networkconnectivity between resources. In one embodiment, Application andTopology Discovery 501 can retrieve infrastructure metadata fromResource/Metadata Store 402 and search the operator metadata embedded inthis infrastructure metadata for patterns in order to identify “groups”of resources. For example, naming conventions in the operator metadata(e.g., “Mongo-1”, “Mongo-2”, “Mongo-3”) can indicate a group of similarresources. Security group (i.e., firewall) rules can also showrelationships between different components. These security group rulescan define how other instances or resources may contact the targetinstance of resource. For example, this can be defined in terms of whichnetwork ports are open to the network. Alternatively, security grouprules can define what portions of the network can contact the targetresource (e.g., by IP address). In yet another alternative, securitygroup rules can be defined in terms of what other security groups cancontact the target resource. If different resource instances have thesame security group (firewall rules), it is likely that these resourceinstances serve a similar purpose and should be grouped together.Application and Topology Discovery 501 can also use deployment patternsto infer service architecture. For example, the set of instances behind,or being serviced by, a load balancer can serve the same role andconfiguration, and therefore should be grouped together. The deploymentpattern can also be used to infer hierarchies: the same set of instancesbehind the load balancer could be nested within a larger group. Inaddition, Application and Topology Discovery 501 can use the uniquefingerprint of well-known technologies and services (such as MongoDB,MySQL, Apache, etc.) to identify commonly-used server software and putresources which use the same software into the same group or cluster.These unique fingerprints can be a well-known port on the resource thatis open to the network. For example, HTTP traffic typically runs on portTCP 80, or sometimes on port 8080. HTTPS is usually run on TCP 443.Alternatively, the monitoring system can probe ports of resources andtry to deduce the resource type by the response it receives. Someresponses from resources can include information regarding whatserver/version the resource is running. In yet another alternative, themonitoring system could look at traffic originating from a targetresource. By observing the requests or queries originating from thetarget resource, the monitoring system might be able to deduce somethingabout the target resource. More generally, the problem of inferring theservice architecture can be viewed as a clustering problem. Givenmetadata (including infrastructure metadata and operator metadata) for acollection of resources, Application and Topology Discovery 501 cansearch for clusters of similar resources as defined by similarity inmetadata. One possible approach to this problem is using K-meanclustering algorithm.

Groups can further be classified as clusters based on the dynamicperformance of resources in the group. The Application and TopologyDiscovery 501 retrieves performance related data (e.g., memory usage,I/O patterns, and network usage) from Live Metric Database 401.Approaches for performing cluster outlier detection analysis (describedlater in relation to Analysis Engine 5032) can be modified to detectwhen resources are behaving similarly instead of when resources arebehaving differently. If the resources are of the same type, and theyare behaving similarly, these resources can be classified as a cluster.

Because of the dynamic nature of the cloud and the iterative designapproach employed by many development teams, the service architecturedetected via Application and Topology Discovery 501 can itself bedynamic. The process can detect small incremental changes in theinfrastructure (such as when virtual resources are provisioned orretired) and more significant changes (such as when the servicearchitecture of the application changes). To handle these small andlarge changes, the Application and Topology Discovery process 501 is runperiodically. Changes in the service architecture are often related tochanges in resource metadata (as recorded in Resource/Metadata Store402) and changes in resource inventory and status (as recorded in EventStore 406).

Event Detection 502, Batch Analysis and Reporting 503, and ExceptionMonitoring 504 are other analysis modules. Batch Analysis and Reporting503 is shown in more detail in FIG. 4. Event Detection 502 and ExceptionMonitoring 504 are shown in more detail in FIG. 5.

FIG. 4A shows the batch analysis and reporting module of the presentlydisclosed monitoring system, according to some embodiments.Specifically, FIG. 4A shows the Batch Analysis and Reporting analysismodule in more detail. The main component of this module is the AnalysisEngine 5032. Analysis Engine 5032 can be responsible for directing theanalysis functionality. Analysis Engine 5032 can pull data from threesources: Live Metric Database 401, Resource/Metadata Store 402, andEvent Store 406. Live Metric Database 401, Resource/Metadata Store 402,and Event Store 406 were described previously. As discussed above, LiveMetric Database 401 can provide a combination of historical performancedata and current performance from the customer environment.Resource/Metadata Store 402 can provide infrastructure inventory andtopology.

Analysis Engine 5032 can also rely on Analysis Config Library 5031,which can include definitions of metrics, performance patterns, bestpractices, and failure modes used by Analysis Engine 5032 to performtargeted analysis of the customer environment. Analysis Config Library5031 can provide a repository for storing how to make general analysistechniques applicable to specific resource types. For example, it canrecord which metrics are important for analysis, describe thresholds forabsolute and relative magnitude for changes to be interesting, andrecord rules for classifying the state of a resource based on theresource's measured behavior.

In some embodiments, Analysis Config Library 5031 can be pre-populatedbased on the best practices from experienced practitioners of cloudsystems. For example, the library can define model analysisconfigurations for various combinations of resource type, service type,and/or system/service metric that the customer is expected to find mostrelevant and interesting. For instance, the library can define modelanalysis configurations for when the customer is interested inmonitoring request count on a load balancer or messages in a queueservice. Such a model analysis configuration can consist of the set ofanalyses that should be run for that combination of resource type,service type, and/or system/service metric, configuration parameters forthe algorithms, and semantic data about the metric. An example of aconfiguration parameter is the statistical confidence interval for oneof the statistical steps of the algorithm. An example of semantic datais the instruction that the metric “resource exhaustion” is only ofinterest if it is predicted to happen within 3 days, or that an anomalyis only interesting if the absolute magnitude is 1 MB/s. Bypre-populating Analysis Config Library 5031 with best practices fromexperienced practitioners, the monitoring system can suggest to newcustomers the most relevant and intuitive configurations for monitoringand displaying the appropriate data for their application. Whilecustomers can always customize what metrics they want to measure and howthey want to view it, the Analysis Config Library 5031's pre-populatedmodel configurations can make it significantly easier for customers toquickly customize the monitoring system for their needs. Thefunctionality of Analysis Config Library 5031's model analysisconfigurations are discussed in more detail with regard to UserInterface 601 and FIG. 6 below.

Analysis Engine 5032 can perform several different analyses (5042, 5044,5046, etc.) to identify risks to the customer application and predictproblems. Some of these analyses are described in more detail below.Most analysis techniques analyze data related to a single customer. Insome embodiments, however, the Analysis Engine 5032 can also performsome analyses across multiple customers to measure system-wideperformance and availability of a cloud provider's infrastructure. WhenAnalysis Engine 5032 produces a result, it can send the result to aNotifier 6021 that can be responsible for forwarding the resultappropriately.

Six types of analyses are described below: anomaly detection analysis,cluster outlier detection analysis, resource exhaustion predictionanalysis, host contention detection, bottleneck identification, andcluster utilization analysis.

One analysis that can be performed by Analysis Engine 5032 is anomalydetection. Generally, many metrics are expected to be relatively stablein value. Sharp variations often indicate problems. For these metrics,the monitoring system can analyze the data for transient changes andpermanent shocks. Change detection can be more challenging in dynamicenvironments, which is typical of applications deployed in the cloud.Metrics in dynamic environments can change due to periodic (e.g., daily,hourly) patterns or long term changes (e.g., growth or decline).

FIG. 4B shows a flowchart depicting data analyses that can be performedby the batch analysis and reporting module, according to someembodiments. One example algorithm for detecting anomalies is depictedin the flowchart on FIG. 4B. In step 4002, the monitoring system cancreate a set of test signals with shapes that are to be detected. Theset can include a spike up, spike down, step up, and step down. In step4004, the analysis engine can query the Resource/Metadata Store for theset of resources in the customer environment. In step 4006, the analysisengine can query the Analysis Config Library 5031 for the metrics thatshould be analyzed and any special configuration or instructions thatare required for analyzing that type of signal. An example of such aspecial configuration or instruction can be a minimum magnitude ofchange that the signal must exhibit before identifying an anomaly. Instep 4008, the analysis engine can read data regarding how a resource ormetric is behaving from Live Metric Database 401. In step 4010, theanalysis engine can prepare the signal for analysis. For example, if thesignal is stationary, the analysis engine can simply analyze the rawsignal. If the signal is trend stationary, the analysis engine cananalyze the first difference of the signal. If the signal is periodic,the analysis engine can decompose the signal into season, trend, andrandom components. In step 4012, the analysis engine can run changedetection on the prepared signal to identify the set of potential pointsof interest (changepoints) for analysis. Changepoints are determinedbased on changes in the mean or variance of the signal. In step 4014,the analysis engine can, for each changepoint, (i) compute the Pearsoncorrelation of the signal against each of the test signals, (ii)identify the test signal with the highest correlation to the signal,(iii) if the correlation does not meet some threshold, stop, and (iv)record the offset in the signal and the best pattern match. For eachpattern match that was detected, the analysis engine can then describethe change. For example, if the signal matches a spike pattern, theanalysis engine can extract the magnitude of the spike compared to thenormal parts of the signal. If the signal matches a step pattern, theanalysis engine can extract the magnitude of the change (for trendstationary signals, spikes in the first difference correspond to stepsin the original signal). The analysis engine can also apply anyadditional checks called for by the Analysis Config Library 5031 for thegiven resource type and metric. For example, the analysis engine canensure that the magnitude of the change is significant relative to thevariance of the signal. Or, the analysis engine can ensure that themagnitude of the spike is truly unique within the signal. Finally, instep 4016, the analysis engine can score the severity of each change,and report any pattern changes that score over a given threshold.

Another type of analysis that can be performed by the Analysis Engine5032 is cluster outlier detection analysis. As discussed above, thecustomer's application topology can be composed of “clusters” ofresources that are expected to behave in a correlated manner. Clusteroutlier detection analysis correlates behavior across the cluster. Ifthe behavior of any cluster member deviates from its peers, that memberis flagged as potentially faulty. For example, a cluster of web serversall used for ad serving should generally be expected to behave in asimilar manner. However, one web server may deviate from its peers(e.g., have significantly slower response times, or have a much longerbacklog of requests) because it was incorrectly configured in the lastsoftware update, or because the load balancer that is responsible fordistributing loads across web servers is not distributing loads evenly.Several algorithms for detecting this type of issue is presented next.The monitoring system can combine results from multiple analyses toachieve greater confidence in its combined analysis. Options forcombining results can include majority or consensus.

FIG. 4C shows a flowchart depicting data analyses that can be performedby the batch analysis and reporting module, according to someembodiments. One example algorithm for performing cluster outlierdetection analysis is depicted in the flowchart in FIG. 4C. In step4102, the analysis engine can query Analysis Config Library for themetrics that should be analyzed across a cluster. In step 4104, theanalysis engine can query Application Database 407 for the set ofclusters in the customer environment. In step 4106, the analysis enginecan query Resource/Metadata Store 402 to determine the member resourcesin each cluster. In step 4108, the analysis engine can read data fromLive Metric Database 401 for live performance-metric data for allmembers. In step 4110, the analysis engine can compute a correlationmatrix for all member pairs. For example, if there are N signalscorresponding to N members, the correlation matrix would be an N by Nmatrix where the entry on row i, column j is the correlation betweensignal i and signal j. In step 4112, the analysis engine can sum, foreach member, the correlations against all other members, i.e., sum upthe rows of the correlation matrix to compute a total for each signal.In step 4114, the analysis engine can determine if the sum ofcorrelations for one member is an “outlier,” i.e., is significantlydifferent from its peers. The Inter-Quartile Range (IQR) method foroutlier detection can be used. The Inter-Quartile Range method allowsthe present system to detect outliers by determining a differencebetween upper and lower quartiles of a statistical distribution. In step4116, the analysis engine can score the degree to which the member is anoutlier. Finally, in step 4118, the analysis engine can report theoutlier to the Notifier 6021.

FIG. 4D shows a flowchart depicting data analyses that can be performedby the batch analysis and reporting module, according to someembodiments. Another example algorithm for performing cluster outlierdetection analysis is depicted in the flowchart in FIG. 4D. In step4202, the analysis engine can query Analysis Config Library for themetrics that should be analyzed across a cluster. In step 4204, theanalysis engine can query Application Database 407 for the set ofclusters in the customer environment. In step 4206, the analysis enginecan query Resource/Metadata Store 402 to determine the member resourcesin each cluster. In step 4208, the analysis engine can read data fromLive Metric Database 401 for live performance-metric data for allmembers. In step 4210, the analysis engine can use Analysis of Variance(ANOVA) analysis to determine if one data set is statistically differentfrom the others. ANOVA analysis refers to analysis that determineswhether samples are drawn from statistically different populations bycomparing multiple data sets. In general, the present system collapsestime series data, which removes the notion of time from the data. Thepresent system then determines whether the mean and variance of oneseries is statistically different from the means and variances of otherseries to determine whether there is an outliner. In step 4212, if theANOVA analysis indicates there is an outlier, the analysis engine canuse the Tukey HSD test to identify the outlier. The Tukey HSD testrefers to a technique that also determines whether samples are drawnfrom statistically different populations. In contrast to ANOVA, Tukeydetermines pairwise differences between samples. In some embodiments,the present system uses Tukey analysis to identify a particular outlier,after using ANOVA analysis to determine the existence of an outlier. Instep 4216, the analysis engine can score the degree to which the memberis an outlier. Finally, in step 4218, the analysis engine can report theoutlier to the Notifier 6021.

FIG. 4E shows a flowchart depicting data analyses that can be performedby the batch analysis and reporting module, according to someembodiments. Yet another example algorithm for performing clusteroutlier detection analysis is depicted in the flowchart in FIG. 4E. Instep 4302, the analysis engine can query Analysis Config Library for themetrics that should be analyzed across a cluster. In step 4304, theanalysis engine can query Application Database 407 for the set ofclusters in the customer environment. In step 4306, the analysis enginecan query Resource/Metadata Store 402 to determine the member resourcesin each cluster. In step 4308, the analysis engine can read data fromLive Metric Database 401 for live performance-metric data for allmembers. In step 4310, the analysis engine can perform the followingregression analysis: for each member, regress all other members asindependent variables onto the member being analyzed as the dependentvariable. In step 4312, the analysis engine can record thegoodness-of-fit for each regression. After all regressions are complete,the analysis engine can compare the goodness-of-fit results. If all butone regressions show a good fit (as defined by some threshold), theanalysis engine can consider the member that does not produce a goodregression the outlier. In step 4316, the analysis engine can score thedegree to which the member is an outlier. Finally, in step 4318, theanalysis engine can report the outlier to the Notifier 6021.

All of the cluster outlier detection analyses described above inrelation to FIGS. 4C-4E can also be modified to detect when resourcesare behaving similarly to one another rather than differently from eachother. Detecting when resources are behaving similarly can be usefulwhen inferring how to characterize resources into clusters based ondynamic performance of said resources (discussed previously in relationto Application and Topology Discovery 501). For example, the algorithmin FIG. 4C can be adapted such that, instead of determining if the sumof correlations for one resource is an “outlier” (i.e., significantlydifferent from its peers), the algorithm can classify a group ofresources as a cluster if the entries in the correlation matrix, or thesum of correlations for each resource in the group, is above a certainthreshold. The algorithm in FIG. 4D can be adapted such that resourcesin a group are classified as a cluster if the ANOVA analysis indicatesthat there is no outlier among these resources. The algorithm in FIG. 4Ecan be adapted such that resources are classified into a cluster if thegoodness-of-fit results for all regressions are above a certainthreshold.

Another type of analysis that can be performed by the analysis engine isresource exhaustion prediction analysis. Some resources, like memory anddisk space, have a hard limit. This analysis estimates when importantresources will be exhausted using historical trends of resource usage.

FIG. 4F shows a flowchart depicting data analyses that can be performedby the batch analysis and reporting module, according to someembodiments. One example algorithm for performing resource exhaustionprediction analysis is depicted in the flowchart in FIG. 4F. In step4402, the analysis engine can query Analysis Config Library 5031 for theset of resources (and their corresponding metrics) that should beanalyzed. In step 4404, the analysis engine can query ApplicationDatabase 405 to determine the hard limit for each metric. In step 4406,the analysis engine can query Resource/Metadata Store 402 for the set ofresources used by the customer. In step 4408, the analysis engine canquery Live Metric Database 401 for the data that measures theconsumption of the resource. In step 4410, the analysis engine fits aline to the data using, for example, Ordinary Least Squares (OLS) linearregression. In step 4412, the analysis engine computes thegoodness-of-fit of the line (i.e., the R² value). If the line fit isbelow a predetermined threshold, the analysis engine can stop theanalysis. In step 4414, the analysis engine can use the slope andintercept of the regression along with the current amount of theresource remaining to solve for when the line will cross the metricthreshold. In step 4416, if the analysis engine estimates that theresource will be exhausted within some specified amount of time, theanalysis engine can report the result to Notifier 6021.

Another type of analysis that can be performed by the analysis engine ishost contention detection. This type of analysis looks for signs thatthe physical machine of the provider on which the customer's virtualresource is hosted is experiencing problems (e.g., is contended or isunder high load from other tenants), such as that other virtual machinesrunning on the same physical host are impacting the customer'sapplication instances. Detecting when a physical machine on which aresource is being hosted is experiencing problems is at once useful anduniquely challenging in the cloud context. It is useful because if thecustomer knows that a virtual resource's physical host is experiencingproblems, the customer can terminate that virtual resource and ask to beallocated a new virtual resource. When the infrastructure providerresponds by allocating a new virtual resource, this new resource wouldlikely be hosted on another physical host, one that probably does notsuffer from whatever problem is affecting the original physical host. Insome embodiments, the monitoring system can be configured to suggestjust such a course of action to the customer. However, detectingproblems with physical hosts can also be challenging in the cloudcontext because customers and the monitoring system do not have directvisibility into the operation, identity, location, or configuration ofthe physical hosts. Whereas in traditional infrastructureimplementations a monitoring system might simply query the physical hostdirectly to determine whether it is experiencing problems, a cloudmonitoring system does not even know the identity of the physicalmachine that is hosting the customer's resource, much less how it isconfigured or where it is located. Instead, the cloud monitoring systemmust indirectly infer that the physical host is experiencing problemsusing the approaches described herein.

Three metrics can be relevant to this type of analysis: (i) a “CPU stealmetric” that is related to the amount of time that a virtual machine isforced to “wait involuntarily” while the CPU is busy processing othertasks, for example for other customers, (ii) a “CPU utilization” metricthat is related to the degree to which the CPU's available processingtime and power is being utilized, and (iii) a “CPU idle” metric that isrelated to the amount of time in which an application has access to theCPU, but has no task for the CPU to perform. These metrics can bestandard metrics reported by the System Data Collector 302.

FIG. 4G shows a flowchart depicting data analyses that can be performedby the batch analysis and reporting module, according to someembodiments. One example algorithm for performing host contentiondetection is depicted in the flowchart in FIG. 4G. In step 4502, theanalysis engine can query Resource/Metadata Store 402 for all instancesin the customer environment running System Data Collector 302 (becausethis analysis requires data only reported by System Data Collector 302,or the “agent”). In step 4504, the analysis engine can query Live MetricDatabase 401 for the three types of metrics discussed above: CPU steal,CPU utilization, and CPU idle. In step 4506, the analysis engine cananalyze these metrics. For example, the analysis engine can compute themean of CPU steal across time. If the CPU steal mean is greater thansome threshold, the analysis engine can label the instance as having a“noisy neighbor,” meaning the CPU on which the customer's application isrunning is also running another application (e.g., for another customer)that is consuming a large share of the CPU's resources. As anotherexample, the analysis engine can compute the percentage of time that theresource is busy using the CPU utilization metric. If the percentage oftime a resource is busy is above some threshold, and the CPU steal meanis greater than some threshold, the analysis engine can label theresource as “throttled.” Similar thresholds and labeling techniques canbe done for the CPU idle metric.

In some embodiments, the thresholds for CPU steal, CPU utilization andCPU idle can be stored in Analysis Config Library 5031. These thresholdscan be pre-programmed into the monitoring system based on experiencedpractitioners' judgment, or can be set or modified by a customer. Thesethresholds can also be programmed based on an expected performance oreven based on a Service Level Agreement (SLA) established between thecustomer and an infrastructure provider, which guarantees a certainlevel of performance for the resources being provided to the customer.If the thresholds are based on expected performance or based on a SLA,the host contention analysis described herein can be used to detectviolations in which the infrastructure provider fails to provide theexpected level of performance or the level of performance that it hadpromised to provide for the customer. In other embodiments, thesethresholds can be based on analysis of CPU steal, CPU utilization or CPUidle metrics for other resources that are expected to behave similarlyto the resource being analyzed. Consider, for example, a scenario inwhich the monitoring system is monitoring five web servers that aresimilarly configured, perform similar roles, and are expected to performsimilarly. If one web server begins to exhibit significantly differentCPU steal, CPU utilization, and/or CPU idle metrics than the other four,the monitoring system can infer with a reasonable degree of confidencethat one problematic virtual web server is being hosted on a physicalhost that is underperforming (e.g., is contended or is under high loadfrom other tenants). In yet other embodiments, the thresholds can be setbased on historical performance of the virtual resource being analyzed.If the resource exhibits significantly lower performance than before,the monitoring system can again infer that the physical machine on whichthe virtual resource is hosted is experiencing problems (e.g., iscontended or is under high load from other tenants). In step 4508, theanalysis engine can report the result to Notifier 6021.

While the above discussion was centered around CPU steal, CPU idle andCPU utilization metrics, it is to be understood that other performancerelated metrics can also be analyzed using the techniques describedabove to detect host contention. For example, just as the CPU is ashared resource in a virtualized environment, the network interface andthe hard disk are often shared resources. When either of those resourcesare over-provisioned, the amount available for use by the virtualinstance may be less than what is available on other similarly sizedinstances for which those resources are not over-provisioned. To detectif a resource is over-provisioned, one may use a cluster outlieranalysis, as described elsewhere. In a cluster, if one member isreporting much less network or disk performance, it may be due to othervirtual instances on the same physical host also using some of thoseresources.

In addition to the algorithms described above, the responsiveness ofProvider APIs 305 can be a proxy for the general health and availabilityof the provider infrastructure. If the Provider API 305 for a particularvirtual resource returns either an error code, incorrect or nonsensicalmetadata, an excessively delayed response, or some other problematicresponse, the monitoring system can conclude that there are problemswith one or more of the cloud provider services, or there is a problemwith that virtual resource. An example of problems with one or morecloud provider services may be network connectivity issues within agiven data center. An example of a problem with that virtual resourcemay be that the physical machine on which that virtual resource ishosted is overly burdened with other customers' applications, may beimproperly configured, or may be experiencing some other problem notdirectly visible to the customer or the monitoring system. In this way,the monitoring system can use the status of API requests across allcustomers to gauge the health of the provider. The monitoring system canreport the error rate on a per-provider, per-service, and/or per-regionbasis.

Another type of analysis that can be performed by the analysis engine isbottleneck identification. This analysis can identify resources that arebusy, as measured by some utilization metric. These resources can becandidates for scaling up or out.

FIG. 4H shows a flowchart depicting data analyses that can be performedby the batch analysis and reporting module, according to someembodiments. One example algorithm for performing bottleneckidentification is depicted in the flowchart in FIG. 4H. In step 4602,the analysis engine can query Analysis Config Library 5031 for the setof resource types and metrics that should be analyzed for utilizationand the utilization threshold. In step 4604, the analysis engine canquery Resource/Metadata Store 402 for resources in the customerenvironment. In step 4606, the analysis engine can query Live MetricDatabase 401 for data related to the required metric. In step 4608, theanalysis engine can compute the percentage of time that the utilizationof that metric exceeds the given threshold. In step 4610, if thepercentage exceeds a specified threshold, the analysis engine can reportthe result to the Notifier 6021.

Yet another type of analysis that can be performed by the analysisengine is cluster utilization analysis. This type of analysis computesthe aggregate resource utilizations across a cluster.

FIG. 4I shows a flowchart depicting data analyses that can be performedby the batch analysis and reporting module, according to someembodiments. One example algorithm for performing cluster utilizationanalysis is depicted in the flowchart in FIG. 4I. In step 4702, theanalysis engine can query Application Database 405 for topologyinformation for clusters. In step 4704, the analysis engine can queryResource/Metadata Store 402 for members of those clusters. In step 4706,the analysis engine can query Live Metric Database 401 for a metricacross all members. In step 4708, the analysis engine can compute theaggregate utilization across all members in those clusters for thecustomer. In step 4710, the analysis engine can send the result toNotifier 6021.

Similar to how the analysis engine computes resource utilization acrosseach customer's infrastructure, the analysis engine can also combineresults across customers. The cross-customer results serve as abenchmark by which customers can evaluate and compare their usage.

All of the algorithms described above can be tuned by feedback providedby the customer, either explicitly or implicitly. Explicit feedback isinformation provided by the customer knowingly through feedbackmechanisms defined in the monitoring system. Implicit feedback isinformation deduced by the monitoring system by observing customerinteraction (or absence of interaction). Feedback (both explicit andimplicit) might include information such as the following:

-   -   A particular analysis result is not of interest.    -   An analysis result is not of interest because the magnitude too        small.    -   An analysis result is not of interest because of the resource or        group analyzed.    -   An analysis result is not of interest because of the type of        analysis.    -   An analysis result is not of interest because of the metric        analyzed.    -   An analysis type should never be performed.    -   An analysis type should never be performed for a resource or        group.    -   An analysis on a specified metric should never be performed.    -   An analysis on a specified metric should never be performed for        a resource or group.

Through feedback, the monitoring system can cater the analysis resultsthat it performs and communicates to the user.

All of the above analytics, i.e., anomaly detection analysis, clusteroutlier detection analysis, resource exhaustion prediction analysis,host contention detection, bottleneck identification, and clusterutilization analysis, can be enhanced using the inferred hierarchicalrelationships in the service architecture deducted by Application andTopology Discovery 501. Anomalies, outliers, resource exhaustion issues,host contention issues, bottleneck issues, cluster utilization issues,or other potential problems worthy of notification discovered within agroup by the above-described analytics can be percolated up to parentgroups within which the group is nested, and then further up tograndparent groups, etc. This escalation of issues up the chain toparent groups can be useful for conducting root cause analysis. Forinstance, the monitoring system can tell a customer that a problem hasbeen discovered within a certain parent group (e.g., the “Cassandra”group). The customer can then drill down to see which sub-group withinthis group is causing the problem (e.g., “Cassandra cluster 34”), andthen drill down further to sub-sub-groups (e.g., “Cassandra cluster 34,instance A”). In this way, the customer can quickly identify the rootcause of the problem discovered within the parent group.

FIG. 5 is a block diagram showing the operation of the Event Detectionmodule, the Exception Monitoring module, and the Policy Analyzer module,according to some embodiments. Specifically, FIG. 5 shows the details ofthe Event Detection 502 and Exception Monitoring 504 modules. The EventDetection module 502 can be configured to sense when a new resource hasbeen added or removed, and/or when there have beeninfrastructure-related changes in the customer environment. TheException Monitoring module 504 can be configured to sense whenmonitored metrics in the customer environment behave in unexpected ways,which may be indicative of an anomaly or problem in the customer'sapplication.

Event Detection module 502 can perform different types of analysisrelated to detection of infrastructure changes in the customerenvironment. Three example types of analysis are described below: LogScanning 5024, Infrastructure Change Detection 5021, and Resource ChangeDetection 5022.

Log Scanning analysis 5024 takes as input log data from Event Store 406(which Event Store 406 receives from Log Data Collector 306 via DataGateway 701), and scans the log data to find patterns or events thatshould be noted to other parts of the system.

Infrastructure Change Detection analysis 5021 can detect changes in acustomer's infrastructure using as input infrastructure metadatareceived from Resource/Metadata Store 402. Resource/Metadata Store 402,in turn, can receive this metadata from Infrastructure PlatformCollector 301, which can receive it from Provider APIs 305. Thisinfrastructure metadata can be global-scale metadata that describes thecustomer infrastructure, and can include the set of virtual resources inthe environment, their location, and the security rules defined in theinfrastructure. To detect changes in a customer's infrastructure (forexample, when resources are added or removed), Infrastructure ChangeDetection analysis 5021 can keep an inventory of a customer'sinfrastructure in a database, and periodically update this inventory byquerying Resource/Metadata Store 402 (which in turn receives informationfrom Provider APIs 305). Event Detection module 502 can compare theresults of each query with the state stored in the database.

As discussed above, the customer's infrastructure (i.e., the set ofresources allocated to the customer by the infrastructure provider) canchange dynamically as the customer grows and shrinks their environmentto track short-term load changes (due to time of day or day of week) orlong-term load changes (due to business growth or contraction) orreplaces instances to deploy new software versions. InfrastructureChange Detection analysis 5021 can therefore query Resource/MetadataStore 402 (which in turn can query Provider APIs 305 via InfrastructurePlatform Collector 301) frequently enough to capture an expected rate ofchange of the customer's infrastructure. In one embodiment, analysis5021 can update the customer's infrastructure on a frequent but constantbasis (e.g., 5 minutes). In another embodiment, analysis 5021 can varyits queries according to different times of day or seasons, or accordingto external events that are expected to cause large changes in thecustomer's infrastructure.

When new resources are detected in the customer's infrastructure, orwhen old resources are removed, Infrastructure Change Detection analysis5021 can record that fact in Event Store 406 in two different ways: itcan send a message to Event Store 406 via Data Gateway 701 (not shown inFIG. 5), or it can send a message to Event Store 406 via Policy Analyzer408 (discussed in more detail below).

Resource Change Detection analysis 5022 can detect changes in resourcesused by a customer's application. This analysis uses as input resourcemetadata, which can also come from Resource/Metadata Store 402.Resource/Metadata Store 402, in turn, can receive this resource metadatafrom Infrastructure Platform Collector 301, which can receive it fromProvider APIs 305. This resource metadata can include, for example,instance type, tags, and running services for virtual instances, andport maps and backing instances for hosted load balancers. Similar toInfrastructure Change Detection analysis 5021, Resource Change Detectionanalysis 5022 can store the state of resources monitored by theinfrastructure in a database. When Resource Change Detection analysis5022 receives new metadata for a resource, the analysis can compare thereceived metadata to the last known state of that resource. If the stateof these resource has changed, an event is sent to Event Store 406 viaData Gateway 701 (not shown). Also similar to Infrastructure ChangeDetection analysis 5021, Resource Change Detection analysis 5022 canupdate its state for resources frequently enough to capture an expectedrate of change of the customer's infrastructure.

Turning now to Exception Monitoring module 504, this module can alsoperform multiple types of analysis related to detecting exceptions indata streams of monitored metrics. Two examples types of analysis aredescribed below: Data Condition Detection analysis 5023, and IntelligentChange Detection analysis 5025.

Data Condition Detection analysis 5023 accepts as input a stream ofmetric data directly from Data Gateway 701. This stream of metric datacan be time series of measurements for metrics, coming from ProviderAPIs 305, System Data Collector 302, or Application Data Collector 303,and can include output from various applications and services. Thisstream of data is similar to the kind of data that is sent to and storedin Live Metric Store 401; however, for reliability reasons, ExceptionMonitoring module 504 can be configured to receive this data directlyfrom Data Gateway 701, just in case Live Metric Store 401 is disabled orotherwise unavailable.

Data Condition Detection analysis 5023 can evaluate this stream ofmetric data for conditions. The conditions which it evaluates are basedon policies stored in Policy Database 404. For example, Data ConditionDetection analysis 5023 can evaluate whether a given metric is above orbelow a given minimum or maximum threshold. Alternatively, DataCondition Detection analysis 5023 can evaluate whether a given metricexceeds thresholds for a preconfigured period of time.

Sometimes, however, Exception Monitoring module 504 can receive metricdata out of order. Or, a policy which Data Condition Detection analysis5023 is evaluating can require 30 minutes or more of data in order tocompute the result of the condition. To overcome these problems,Exception Monitoring module 504 can store in local memory enough datafor the given metric as specified by the “duration” of the conditionbeing evaluated (discussed below). Each time a measurement is streamedto the detector, the detector can merge the measurement with theexisting set of measurements for the condition. For the “all” and“average” type of conditions (discussed below), the detector can updatethe value of the condition. If the value of the condition changes, thedetector can emit an event to the Policy Analyzer 408.

The detector can also be configured to compensate for “flapping.”Flapping is when the value of a condition changes incorrectly due to outof order messages. To avoid flapping, the detector can be configured towait until a period of time has passed before emitting a message to thePolicy Analyzer 408. The period of time can be based on the messagedelay that the detector is currently observing, which the detector candetermine by analyzing the timestamps in the messages streaming throughit. This probabilistic approach can help prevent flapping. The chance offlapping can be further reduced by additionally adding a buffer to thedelay.

Intelligent Change Detection analysis 5025 is similar to but moresophisticated than Data Condition Detection analysis 5023. Inparticular, this type of analysis can take as input not only metric datafrom Data Gateway 701, but also past metric data from Live MetricDatabase 401. This added source of data allows Intelligent ChangeDetection analysis 5025 to perform more complex analyses. If, however,Live Metric Database 401 is disabled or otherwise unavailable, ExceptionMonitoring module 504 will not be able to perform Intelligent ChangeDetection analysis 5025.

Intelligent Change Detection analysis 5025 can search for deviations ofbehavior that have not been specifically configured by the user andstored in the Policy Database 404. The goals of this type of analysis issimilar to that of the Anomaly Detection type analysis performed in theAnalysis Engine 5032, i.e., this analysis uses past behavior (drawn fromLive Metric Database 401) to create a model for behavior of a resourcein the near future. As metric data arrives from Data Gateway 701, thedetector compares that data against the model constructed from past datafrom Live Metric Database 401. If the data does not fit the model, thecondition is raised. Because it operates on current data coming fromData Gateway 701, the algorithms used for detecting changes differ fromthat used by the Anomaly Detection type analysis used by the AnalysisEngine 5032 in the Batch Analysis Subsystem 503. One example algorithmis given below.

FIG. 5B is a flowchart depicting one example algorithm that can be usedby Intelligent Change Detection analysis in the Exception Monitoringmodule, according to some embodiments. Specifically, FIG. 5B depicts onepotential way for Exception Monitoring Module 504 to perform IntelligentChange Detection analysis 5025. In step 5102, Intelligent ChangeDetection analysis 5025 fetches a time series of data comprising N dataperiods from Live Metric Database 401 and/or Data Gateway 701, i.e.,data[1], data[2] . . . data[N]. In step 5104, Intelligent ChangeDetection analysis 5025 sets an index variable i to 1, and at the sametime initializes a Boolean array of size N called “Anomaly” to FALSE. Instep 5106, Intelligent Change Detection analysis 5025 generates anAuto-Regressive Integrated Moving Average (ARIMA) model usinguser-specified confidence intervals based on all data periods thatprecede period i, i.e., based on data[1] . . . data[i]. In step 5108,analysis 5025 makes a prediction for what the values for data[i+1] anddata[i+2] should be based on the generated ARIMA model. This predictioncan take the form of expected upper- and lower-bounds for each ofdata[i+1] and data[i+2] rather than discrete values. In step 5110,analysis 5025 compares the prediction for data[i+1] against the actualdata[i+1]. If data[i+1] does not match the predicted data[i+1], or ifdata[i+1] falls outside the expected upper- and lower-bounds fordata[i+1], then the corresponding entry in the Boolean array Anomaly(i.e., Anomaly[i+1]) is set to TRUE in step 5114. Similarly, in step5112, analysis 5025 compares the prediction for data[i+2] against theactual data[i+2]. If data[i+2] does not match the predicted data[i+2],or if data[i+2] falls outside the expected upper- and lower-bounds fordata[i+2], then the corresponding entry in the Boolean array Anomaly(i.e., Anomaly[i+2]) is set to TRUE in step 5116. In step 5118, analysis5025 checks to see if it has iterated through all N periods, i.e., ifi=N. If not, analysis 5025 can increment i by 1 in step 5120. Otherwise,analysis 5025 ends in step 5122. At the end of all the iterations, theresulting array Anomaly[1 . . . N] should have values set to TRUE whereanomalies were detected in the metric data stream.

Two other data sources provide configuration information. ApplicationDatabase 405 can store the result of Application and Topology Discovery501. As discussed previously, Application and Topology Discovery 501 cananalyze data and metadata from the customer's application to establishrelationships between components of the customer's environment. Thetopology information can be used in the definition of some policies.

Policy Database 404 can store global, best-practice, and/or user-definedpolicies. A policy is a Boolean expression of conditions that are ofinterest to the customer.

A condition is a single comparison or test against data or metadata; inother words, conditions describe a change that is of interest to theuser. Conditions for metadata may include: (i) a new resource was found,(ii) the security group for an instance was changed, and (iii) a tag wasadded to an instance.

Conditions on metric data can be more expressive. A metric-basedcondition is defined by a measurement threshold, a duration, and method.The method determines how to evaluate the series of measurements todetermine the condition. There are three methods:

-   -   any—The condition is evaluated to true if any measurement        exceeds the “threshold”. For this method, the “duration” is        ignored.    -   all—The condition is evaluated to true if all measurements in        the period specified by “duration” exceed the “threshold”.    -   average—The condition is evaluated to true if the average of        measurements in the period specified by “duration” exceeds the        “threshold”.

The detectors (5021, 5022, 5023, 5024, and 5025) can then forwardmessages to the Policy Analyzer 408 describing changes in metadata andconditions. The Policy Analyzer can identify when the combination ofconditions for a policy is met. It can read the set of policies from thePolicy Database 404. For policies that refer to topologicalabstractions, it can also read information from the Application Database405. The Policy Analyzer 408 can compare the state of the customerapplication as defined by the series of events emitted from thedetectors that reflect conditions in the infrastructure against thepolicies. When the combination of conditions defined by a policy aresatisfied, it can store in Event Store 406 a record indicating that apolicy was satisfied, i.e., that an incident has occurred.

Policy Analyzer 408 is positioned to integrate and analyze the state ofdifferent condition types for different detector types. In other words,the Policy Analyzer can integrate data from each of Log Scanning 5024,Infrastructure Change Detection 5021, Resource Change Detection 5022,Intelligent Change Detection 5025 and Data Condition Detection 5023.These different types of detectors, in turn, receive data from each ofEvent Store 406, Resource/Metadata Store 402, Live Metric Database 401,and Data Gateway 701. The data processed by these detectors can compriseinfrastructure metadata provided by Provider APIs 305, system-level datafrom System Data Collector 302, application-level data from ApplicationData Collector 303, log data from Log Data Collector 306 and httpend-point data from Endpoint Monitoring Probes 307.

Policy Analyzer 408 can also send a message representing the incident toNotifier 6021. The notifier can determine whether and how notificationof the incident should be made to the customer. If the customer hasspecified that the incident should be handled with an automaticresponse, the notifier can also send a message to the Automation 603subsystem.

Policy Analyzer 408's ability to integrate and analyze as a whole datafrom all these diverse sources and data types can be extremely powerful.For example, if the System Data Collector stops sending data from acertain resource X, Exception Monitoring module 504 (in particular,Intelligent Change Detection analysis 5025 or Data Condition Detection5023) might detect that condition and communicate to Policy Analyzer 408that the condition has been detected, indicating that there might be aproblem with resource X. Before sending an alert to Notifier 6021,however, Policy Analyzer 408 can first check, with the help of EventDetection module 502, the actual state of resource X, as indicated bythe infrastructure metadata for resource X. For example, Policy Analyzer408 can check whether resource X has been terminated or stopped. Ifresource X has been terminated or stopped, Policy Analyzer 408 canconclude that the absence of metrics from resource X is due to thetermination of the resource rather than the failure of the softwarerunning on the instance. Since, as discussed above, the termination of aresource can be a common occurrence given the elastic nature of thecloud, Policy Analyzer can be configured to send only a low prioritynotification, or no notification at all to Notifier 6021. Alternatively,Policy Analyzer 408 can also check whether resource X was stoppedbecause it was released by the customer (in which casetermination/stoppage of resource X would not be a noteworthy event), orwhether resource X was terminated because it was taken away by theinfrastructure provider (in which case termination/stoppage of resourceX would be a noteworthy event). In yet another alternative, ifsystem-level metric data from System Data Collectors do not cut offcompletely, but indicate changed conditions at a resource (e.g., theresource is running at higher or lower capacity than before, or is moreor less responsive than before), Policy Analyzer 408 can check theinfrastructure metadata to determine the type of resource (e.g., AWSt1-micro or AWS m1-small) or the role of resource (e.g., web server vs.load balancer) so that it can apply the appropriate thresholds fordetermining whether the changed condition is worthy of notification. Forexample, the threshold for determining that an increase in web serverload is worthy of notification to the customer can be differentdepending on whether the resource is an AWS t1-micro or an AWS m1-small.To emphasize, the Policy Analyzer is conditionally interpreting thedetection of a condition (the absence of or change in metrics) based onthe broader context of the state of the infrastructure.

Further, the Policy Analyzer could additionally consider the anticipatedstate of resource X. For example, the contents of the Event Store couldcontain audit logs (perhaps collected from the Log Scanning detector)indicating whether a resource was scheduled or requested to beterminated by the customer. If a request was made to terminate aresource by the customer to the infrastructure provider, then theresource's absence from the inventory at the scheduled time should notbe considered worthy of notification. If, however, the instance is foundto be terminated without request from the user, then the situation againbecomes cause for notification.

By combining and analyzing together data from different data sources,Policy Analyzer 408 is able to discriminate between expected andunexpected changes in data conditions, and tailor its notificationscheme accordingly. This example also illustrates why it is importantfor Infrastructure Change Detection analysis 5021 and Resource ChangeDetection analysis 5022 to keep their inventory of the customer'sinfrastructure up-to-date.

Another powerful way in which Policy Analyzer 408 can combine data fromdifferent sources to achieve more robust analysis is Policy Analyzer408's ability to analyze data streams according to the serviceinfrastructure stored in Application Database 405, as determined byApplication and Topology Discovery 501. For example, a customer may wantto be alerted only when two conditions are fulfilled: (i) a resource Xof cluster Y is experiencing higher than usual load, and (ii) thatresource X is experiencing higher load than its peer members in the samecluster Y. In other words, a customer may not care if the entire clusteris running hot, but be concerned if one member is experiencing heavierthan usual load while the others are not. While Exception Monitoringmodule 502 can detect when condition (i) is fulfilled, i.e., thatresource X is experiencing higher than usual load, Policy Analyzer 408can also check for condition (ii) by consulting Application Database 405to determine the list of resources that are also in cluster Y, and bychecking the load conditions of those resources as well.

FIG. 6 is a block diagram showing the operation of the User Interface,according to some embodiments. Specifically, FIG. 6 depicts the flow ofdata to and from the user interface with which the customer/userinteracts. User Interface 601 can provide the user with a mechanismthrough which to view resource and metric information. It also providesthe user with a mechanism through which to view and update user settingsand policy information. For example, users can add charts on dashboards.Users can also specify the metrics and resources that should bedisplayed.

When the user attempts to view pages that include metric data, UserInterface 601 retrieves the appropriate metrics from Live MetricDatabase 401. Likewise, when resource information or metadata isrequested, User Interface 601 retrieves resource inventory and metadatafrom the Resource/Metadata Store 402. When the user requests eventsinformation, the interface retrieves the data from the Event Store 406.When the user views, creates, edits, or deletes policy information, theinterface contacts the Policy Database 404. When the user updates userinformation or application settings (like dashboard configuration ornotification configuration), the interface uses the Application Database405.

User Interface 601 can be configured to filter displayed pages so onlythose resources that satisfy filter criteria are displayed. For example,when a group filter is applied, the user interface can filter the view(e.g., resource lists and graphs) to include only those resources thatsatisfy the group criteria. When filter criteria is manually entered,the application can filter the page contents in real-time.

User Interface 601 can also be configured to allow the user to visualizemetrics via a graphing module. The graphing module comprises thefollowing settings:

-   -   Metrics—Specifies the metrics to be displayed in the chart.    -   Filter—Specifies the resources to be included in the chart.    -   Time Aggregation—Specifies that metrics should be aggregated        across time. The consumer of the chart describes how much the        data should be rolled up and the function to apply to the        metrics during aggregation (all metrics, sum of metrics, average        of metrics, median of metrics, 95th percentile value, 99th        percentile value, etc.).    -   Resource Aggregation—Specifies that metrics should be aggregated        across resources.

This is commonly used to aggregate metrics across members of a group orcluster.

-   -   Timeframe—Specifies the start and end date and time for the        metrics to be displayed on the charts in a page.

As discussed above, Analysis Config Library 5031 can store analysisconfigurations for various combinations of resource type, service type,and/or system/service metric. These analysis configurations are based onexperienced practitioners' input. Part of these analysis configurationscan include instructions for what metrics to show in the graphingmodule, what filters and time aggregation to apply, and what time-frameto display. In this way, the monitoring system can automatically displaythe most relevant metrics in the most intuitive display formatimmediately upon being connected to a new customer's application,without any input from the customer. While the customer can alwayscustomize what is displayed in User Interface 601 according to itsneeds, the monitoring system's ability to automatically “guess” at whatthe customer would like to view, and how the customer would like to viewit, can significantly reduce setup time. For example, if a new customerdefines a group to represent its web application, the customer could beautomatically presented with a dashboard with key metrics related to thecustomer's load balancer, web server cluster, Apache web service, andother relevant service and infrastructure metrics. These metrics couldalso be displayed in the most intuitive format, with the appropriatetime, and resource filters applied.

The graphing module can also support the following user interactions:

-   -   Highlight time series (hover)—Automatically shades any time        series that the mouse pointer is not currently hovering over.    -   Highlight time series (lock)—Shades any time series for which        the user has not toggled highlighting on.    -   Zoom—Enables the user to narrow the view to a particular        timeframe.    -   Event details (hover)—Show details of events that occurred        during the timeframe shown on the chart.    -   Legend toggle—Legends can be toggled on or off    -   Value details (hover)—By hovering over a data point on one of        the curves, the chart will show the exact values of members.

There are many conditions under which it may not be necessary orappropriate to display all measurements for a particular time series. Inthese cases, the user interface will automatically request and displaythe aggregated data with an appropriate granularity from the appropriatedatabases.

Examples might include:

-   -   When viewing several weeks of data, it may be appropriate to        show 60-minute averages.    -   When viewing several days of data, it may be appropriate to show        15-minute averages.    -   When viewing several hours of data, it may be appropriate to        show 5-minute averages.

For certain user interactions, such as zooming, or changing thetimeframe for a page, the system may select a new aggregation for thegraphs.

FIG. 7 is a block diagram showing the operation of the NotificationGateway, according to some embodiments. Specifically, FIG. 7 shows thenotification system for the monitoring system. The Application &Topology Discovery subsystem 501, Event Detection subsystem 502, BatchAnalysis & Reporting subsystem 503 and Exception Monitoring subsystem504 can all send information to the Notification Gateway 6021.Notification Gateway 6021 can then send notifications to a user orgroups of users via a number of methods including email 6022, SMS 6023,and Third Party Notification Systems 6024.

Email Notification System 6022 can accept settings and content fromother systems, and generate and send email messages to users via ahosted third-party email delivery service.

SMS Notification System 6023 can accept settings and content from othersystems, and generate and send SMS messages to users via a hostedthird-party SMS delivery service.

Third Party Notification Systems 6024 can be any other notificationsystem which manages alerts and notifications to end users. Examples ofsuch third party notification systems include notification services suchas that provided by PagerDuty, or Webhook technology. Third partynotification systems can be configured according to settings stored inApplication Database 405.

Notification Gateway 6021 can also control the frequency ofnotifications. For example, it can notify the user only after an issuehas been open for a specified length of time. It can also filter outnotifications that are significantly similar to other recentnotifications to limit the amount of information being pushed to users.

FIG. 8 is a block diagram showing how Infrastructure Platform Collectorcan be scaled to collect data from large numbers of Provider APIs,according to some embodiments. Specifically, FIG. 8 shows how data canbe collected from customer applications on a large scale. In particular,FIG. 8 shows how Infrastructure Platform Collector 3015 can beconfigured not as a single hardware/software module, but as a multitudeof Cloud Collectors 3014, each of which can be implemented in a separatehardware/software instance. Infrastructure Platform Collector Scheduler3011 can be responsible for coordinating the efforts of Cloud Collectors3014 by scheduling collection tasks. To schedule tasks, InfrastructurePlatform Collector Scheduler 3011 can track which cloud providers areused by which customers in the Customer Cloud Platforms Database 3013and the set of APIs in each cloud platform in the Cloud Platform APIManifest 3012. Based on these inputs, the scheduler queues collectiontasks to be run and provides the Cloud Collectors 3014 with all of theinformation they need to talk to the infrastructure provider. CloudCollectors 3014 can de-queue tasks and communicate with Provider APIs305A, 305B, etc. to collect the data they were assigned to collect.After collecting data, the Cloud Collectors 3014 in InfrastructurePlatform Collector 3015 can push messages to Data Gateway 701.

Some cloud infrastructure providers use rate limiting (i.e.,“throttling”) on Provider APIs 305A, 305B, etc. This is a particularchallenge for data collection at scale. Cloud Collectors 3014 use theProvider APIs on behalf of the customer, and rate limiting is imposed onthe customer, not the monitoring system. The Infrastructure PlatformCollector Scheduler 3011 can schedule collection tasks per customer toavoid rate limiting. For example, tasks can be scheduled on a periodicbasis based on the desired fidelity of data. If, however, responses fromthe Provider APIs 305A, 305B, etc. indicate that the queries on behalfof a customer are being throttled, the Infrastructure Platform CollectorScheduler can reduce the frequency of data collection for that specificcustomer. If a customer has more resources than can be queried at thedesired frequency within the constraints of the rate limiting, theInfrastructure Platform Collector Scheduler 3011 must reduce thefrequency of collection across all the customer's resources.Infrastructure Platform Collector Scheduler 3011 can reduce thefrequency of all resources uniformly. Or it can skip monitoring for arandom set of resources in each collection interval.

FIG. 9 is a block diagram showing how the Data Gateway can be scaled tocollect data from large numbers of data collectors, according to someembodiments. Specifically, FIG. 9 shows how Data Gateway 701 can bescaled out to enable the monitoring system to support high load (messagerate). For example, where Data Gateway 701 had been previously describedas a single module, Data Gateway 701 can also be configured as multiplesoftware/hardware modules that work in parallel to handle more load, asindicated by the “stack” of Data Gateways 701 in FIG. 9. This figureshows internal details of a single Data Gateway. The Ingest API 800 canprovide a web service to accept and validate the input messages. It canpush the messages to a publish/subscribe service (Pub/Sub Message Layer801), such as RabbitMQ. Several other components subscribe to messagespublished by the Ingest API.

Archiver 802 can subscribe to all messages from the collection API. Thearchiver lightly processes the messages, especially ensuring theintegrity and ordering of the messages. Then messages can then be storedin Metric Archive 403. Archive 403 is assumed to be capable of handlingthe write throughput requirements of the monitoring system.

Indexer 803 also subscribes to all messages from the Ingest API. Itdivides those messages into several different types: inventory andresource metadata messages, measurement messages, and event messages.Metadata messages are forwarded to the Resource/Metadata Store 402.Measurement messages are stored in the Live Metric Database 401. Eventsare stored in the Event Store 406. Again, it is assumed that these datasinks can scale out to accommodate the load. Scale out NoSQL databases,like Cassandra, are examples of technologies that can handle the loadfor data measurements. Distributed search clusters, like ElasticSearch,are examples of technologies that can handle the load of metadatamessages and event messages.

Data Router 804 subscribes to measurement messages. It routes thosemessages to any Data Condition Detectors 5023 that need the data toevaluate conditions. This route could be based on, for example, acustomer identifier.

Many other components in the monitoring infrastructure require theability to scale out. This capability is often linked to thehigh-availability strategy. A group membership service, like Zookeeper,maintains the manifest of instances serving a given role. If instancesare stateless, like the Cloud Collector (3014), the group membershipservice simply tracks the instances participating. Services thatmaintain state can also use the group membership service to describe howto route work to instances in the service.

Systems, methods, and non-transitory computer program products have beendisclosed for monitoring and analyzing operation of a widely distributedservice operated by an Infrastructure-as-a-Service (IaaS) tenant butdeployed on a set of virtual resources controlled by an independent IaaSprovider. The set of virtual resources provided to the IaaS tenant bythe IaaS provider is selected by the IaaS provider and can changerapidly in both size and composition (i.e., the virtual resources are“ephemeral” from the perspective of the IaaS tenant). The monitoringsystem can monitor and determine relevant alerts based on system-levelmetrics collected directly from virtual resources with infrastructuremetadata that characterizes the virtual resources collected from theIaaS provider to report on operation of the virtual resources. Forexample, the infrastructure metadata can contain a resource type, aresource role, an operational status, an outage history, or an expectedtermination schedule. The monitoring system can analyze the system-levelmetrics to report on operation of at least part of the set of virtualresources. The monitoring system can condition the reporting based onthe infrastructure metadata to avoid inaccurate analysis.

The monitoring system can also automatically infer—without humanmodelling input or information regarding actual physical networkconnectivity—a service architecture of a widely distributed serviceoperated by an IaaS tenant but deployed on a set of virtual resourcescontrolled by an independent IaaS provider. Specifically, the monitoringsystem can automatically infer from the metadata and/or metric data howthe virtual resources should be organized into groups, clusters andhierarchies. The present systems can automatically infer this servicearchitecture using naming conventions, security rules, software types,deployment patterns, and other information gleaned from the metadataand/or metric data. The present systems can then run analytics based onthis inferred service architecture to report on service operation.

The monitoring system can also rapidly update a dynamic servicearchitecture of a widely distributed service operated by an IaaS tenantbut deployed on a set of virtual resources controlled by an independentIaaS provider. For example, the monitoring system can infer from theinfrastructure metadata and/or system-level metric data how the virtualresources should be organized into groups, clusters and hierarchies. Themonitoring system can also update the dynamic service architecturefrequently, to capture an expected rate of change of the resources,e.g., every five minutes.

The monitoring system can also evaluate performance of virtual resourcesdeployed in a widely distributed service operated by an IaaS tenant butdeployed on a set of virtual resources controlled by an independent IaaSprovider, and infer that a virtual resource within the set of virtualresources may be hosted on at least one physical resource that isunderperforming. Although the monitoring system may not have visibilityinto the composition, configuration, location, or any other informationregarding the set of physical resources, the present system is able toevaluate the performance of the virtual resources and infer that avirtual resource within the set of virtual resources may be hosted on atleast one physical resource that is underperforming.

The present system can also monitor and analyze cluster performance bydetecting outliers in a widely distributed service operated by an IaaStenant, but deployed on a set of virtual resources controlled by anindependent IaaS provider. The set of virtual resources can be organizedinto clusters in which resources are expected to behave similarly toeach other. Virtual resources that do not behave similar to peerresources in the same cluster—i.e., outliers—may be indicative ofproblems that need to be addressed. The present system can collectperformance metric data from virtual resources, and compare theperformance of each virtual resource in a cluster with the performanceof every other virtual resource in the cluster to detect outliers. Thiscomparison can involve correlation analysis, ANOVA analysis, orregression analysis, as described earlier.

Other embodiments are within the scope and spirit of the present systemsand methods. For example, the functionality described above can beimplemented using software, hardware, firmware, hardwiring, orcombinations of any of these. One or more computer processors operatingin accordance with instructions may implement the functions associatedwith semantically modelling and monitoring applications and softwarearchitecture hosted by an IaaS provider in accordance with the presentdisclosure as described above. If such is the case, it is within thescope of the present disclosure that such instructions may be stored onone or more non-transitory processor readable storage media (forexample, a magnetic disk or other storage medium). For example, theephemeral resources described above may be stored on non-transitoryprocessor readable storage media under direct or indirect control of theIaaS provider. Additionally, as described earlier, modules implementingfunctions may also be physically located at various positions, includingbeing distributed such that portions of functions are implemented atdifferent physical locations.

The present disclosure is not to be limited in scope by the specificembodiments described herein. Indeed, other various embodiments of andmodifications to the present disclosure, in addition to those describedherein, will be apparent to those of ordinary skill in the art from theforegoing description and accompanying drawings. Thus, such otherembodiments and modifications are intended to fall within the scope ofthe present disclosure. Further, although the present disclosure hasbeen described herein in the context of a particular implementation in aparticular environment for a particular purpose, those of ordinary skillin the art will recognize that its usefulness is not limited thereto andthat the present disclosure may be beneficially implemented in anynumber of environments for any number of purposes.

What is claimed is:
 1. A system for determining that a virtual resourcewithin a set of virtual resources may be hosted on at least one physicalresource that is underperforming, the set of virtual resources beingprovided by an independent Infrastructure-as-a-Service (IaaS) providerto an IaaS tenant for operating a widely distributed service, whereinthe virtual resources are hosted on a set of physical resources that maybe at least one of geographically dispersed, part of differentcommunication networks, and disjoint, wherein the IaaS provider isresponsible for selection of the set of physical resources and anoperational capacity of the set of virtual resources may changesubstantially and rapidly, and wherein the IaaS tenant has no directcontrol over and limited visibility into the selection of the set ofphysical resources, the system comprising: a data gateway configured toreceive CPU utilization information related to the operation of the setof virtual resources; an analysis module configured to determine that acandidate virtual resource may be hosted on an at least oneunderperforming physical resource based on at least one of: (i) acomparison of CPU utilization of the candidate virtual resource with CPUutilization of other virtual resources within the set of virtualresources that are expected to perform similarly, (ii) a comparison ofpresent CPU utilization of the candidate virtual resource withhistorical CPU utilization of the candidate virtual resource, and (iii)a comparison of CPU utilization of the candidate virtual resource withpreconfigured thresholds.
 2. The system of claim 1, wherein the analysismodule is further configured to suggest that the IaaS tenant terminateand relaunch the candidate virtual resource that is determined to behosted on at least one underperforming physical resource so that thecandidate virtual resource may be reassigned to another physicalresource by the IaaS provider.
 3. The system of claim 1, wherein the CPUutilization information includes at least one of a CPU steal metric, aCPU utilization metric, and a CPU idle metric.
 4. The system of claim 3,wherein the comparison of CPU utilization of the candidate virtualresource with CPU utilization of other virtual resources that areexpected to perform similarly includes a comparison of average CPU stealmetrics during a predefined time interval.
 5. The system of claim 1,wherein the preconfigured thresholds are based on an expected level ofperformance for the resources provided by the IaaS provider to the IaaStenant.
 6. The system of claim 1, further comprising an infrastructureplatform collector configured to query Application Program Interfaces(APIs) defined by the IaaS provider to collect infrastructure metadatacharacterizing the set of resources, and to detect when queries to APIsresult in an error condition, wherein the analysis module is configuredto determine that a candidate virtual resource may be hosted on an atleast one underperforming physical resource based on the errorcondition.
 7. The system of claim 6, wherein the error conditionincludes at least one of an error code, incorrect metadata, and adelayed response.